In [1]:
plays = read.table("../../data//hejtmy-plays.csv", sep = ",", header = T)
head(plays)
In [2]:
length(names(plays))
Do you see the weird symbols in the location column? For some reason te table looks alright when we open it in text editor but gets screwed during upload. Well, this is the time to read up on the ENCODING. As it turns out, default encoding that is set by the read.table
function is ANSI. But our file contains non ansi symbols and is encoded with utf-8. Therefore we need to fix it by recoding the column.
Unfortunatelly that is nore fdifficult when the table is already loaded and would need to be done for all columns, so let's just have a look at read.table parameters and fix it.
In [3]:
plays = read.table("../../data//hejtmy-plays.csv", sep = ",", header = T, encoding = "UTF-8")
head(plays)
Perfect. Now, looking at the number of columns, someting is out of order. we have amazing amount of NA values. Also, sometimes same players are in the first position, sometimes in the second, so it would make comparisons between players really obnoxious to do.
What would make more sense woud be to read the table into the so called long format. Therefore we would have multiple lines for the same play with different players as values in the player column. This format is quite used in SQL and with some changes it dos impede the analysis. Unfortuntely its not as easy to o a we would might like. There are some ingenious solutins that ill sho you, but tere are also packages o smooth the way
Because we are going to mess th table up, we need to save the ID of each play. Luckilly for us, its already in there. If we didn't have the ID, we would have needed to create one as simply as plays[,"id"] = 1:nrow(plays)
NEvertheless, let's continue.
Now we need to radically restructure the player part. Basically, we need each player to become one row in the table, herefore we need to split
player.X.username player.X.name player.X.startposition player.X.color player.X.score player.X.new player.X.rating player.X.win
as is often useful, let's start with a single row
In [4]:
row = plays[1, ]
We need to split players to individual rows and then "paste" the original play in front. So let's start with the splitting part. Luckily, the naming conventions are quite clear and column positions are very systematic and regular, so we can use a simple for loop to do it. Let's have a look at the situation we are at:
In [5]:
names(plays)
In [6]:
which(names(plays) == "player.1.username")
which(names(plays) == "player.2.username")
So we know there are maximum of 8 player and each player has 8 columns with information. The player information also starts at the 12th place. Let's try first step of the for loop in here
In [8]:
iPlayer_info = which(names(plays) == "player.1.username")
play_info = row[1:iPlayer_info-1]
nPlayerCol = 8
i = 1 #this will get incremented in the loop later
iStart = iPlayer_info*i
iEnd = iStart + nPlayerCol - 1
player_info = row[iStart:iEnd]
player_row = c(play_info, player_info)
player_row
In [9]:
new_df = data.frame()
iPlayer_info = which(names(plays) == "player.1.username")
play_info = row[, (1:(iPlayer_info - 1))]
nPlayerCol = 8
for (i in 1:8){
# now we want to extract information about the specific player
iStart = iPlayer_info + (i-1) * nPlayerCol
iEnd = iStart + nPlayerCol - 1
player_info = row[, iStart:iEnd]
colnames(player_info) = c("player.username" ,"player.name", "player.startposition", "player.color", "player.score",
"player.new", "player.rating", "player.win")
player_row = cbind(play_info, player_info)
new_df = rbind(new_df, player_row)
}
new_df
Looks good. Now we need to do it for every row. Because we already did it for one row, we only need to reassign the row and resave throughout.
Now this would lead to a lot of empty values. WE can remove afterwards, but we can also speed things up by breaking the forloop when empty player is located.
In [11]:
plays_recoded = data.frame()
for (i in 1:nrow(plays)){
row = plays[i,]
play_df = data.frame()
iPlayer_info = which(names(plays) == "player.1.username")
play_info = row[, (1:(iPlayer_info - 1))]
nPlayerCol = 8
for (i in 1:8){
# now we want to extract information about the specific player
iStart = iPlayer_info + (i-1) * nPlayerCol
iEnd = iStart + nPlayerCol - 1
player_info = row[, iStart:iEnd]
colnames(player_info) = c("player.username" ,"player.name", "player.startposition", "player.color", "player.score",
"player.new", "player.rating", "player.win")
if(player_info$player.name == ""){break} #CHANGE!!!!!!
player_row = cbind(play_info, player_info)
play_df = rbind(play_df, player_row)
}
plays_recoded = rbind(plays_recoded, play_df)
}
Now have a look at it
In [12]:
head(plays_recoded)
Now it works but the function looks horrible. We can fix stuff using functions to make things clearer.
Now the stat tests will be slightly more complicated, because we need to do them play.ID wise, but we should manage. There are packages to easy our ways into that.
In [ ]:
reshape(plays, direction = "long", varying = 11:75)
In [ ]:
In [ ]: